Introduction to R

Short Course on Statistical Programming in R

Smith School of Business - Session 1 (October 6, 2017)





INSTRUCTOR
Eric Dunford | Ph.D. Candidate | GVPT
edunford@umd.edu

Overview

Today we'll cover:

  • Understanding R & R Studio
  • Objects/Data Structures
  • Packages
  • Data Basics (importing/exporting data)
  • Operations
  • Cleaning Text

The main goal is to get familiar with the R environment.

R in a Nut Shell

R is a statistical and graphical programming language that is based off a much older language called S. It's source code is written in C, Fortran, and R. And it's completely free under a GNU General Public License.

What this means for us:

  • No Barriers to Entry: easy to acquire, easy to contribute
  • Active Community: if you can think it, there is likely a package out there that does it.
  • Powerful and Adaptive: build an estimator from scratch, scrape a web-site, automate the coding of a dataset. All is within one's reach.

Why use R?

R offers a powerful way to

  • analyze data
  • clean excel spreadsheets
  • migrate projects across platforms
  • format and clean text
  • manage any data source
  • produce compelling graphics and maps

R Studio

R Studio is a graphical user interface (GUI) for the R programming language. The software makes R more user-friendly adding some point-and-click functionality along with a complete integration of graphs, the data environment, and the coding script.

Think of it like this: R is the engine that runs all our commands, and R Studio is the leather seats and steering wheel. One does the work, the other eases how that work is done.

Installing R and R Studio

To install R, download R from CRAN via the following:

To install R Studio, download from the following:

Getting Familiar with R Studio

R Studio is broken up into 4 quandrants that can be arranged and customized to the users preference.

These quadrants are broken up as follows…

The Console

The console is where all the action happens. This is “R”.

The Console

All commands are processed through the console directly (that is, one can type commands directly into it) or via a script.

Scripts

A script is a .R text file where we write and run code our code.

Scripts

When we write a line of code, we can run it in the console by highlighting the text and…

  • clicking run
  • pressing command + enter (mac)
  • pressing control + enter (windows)

Scripts

Everything in a script will be treated as code – that is if you run it, the line will be processed through the console.

However, we can leave comments and notes to ourselves by commenting out sections of the script using a #

Objects

Objects

R uses a specific set of rules to govern how it looks up values in the environment.

We manage data by assigning it a name, and referencing that name when we need to use the information again.

Officially, this is called lexical scoping, which comes from the computer science term “lexing”. Lexing is the process by which text represents meaningful pieces of information that the programming language understands.

Assigning an Object

In simple terms, an object is a bit of text that represents a specific value.

x <- 3
x
[1] 3

Here we've assigned the value 3 to the letter x. Whenever we type x, R understands that we really mean 3.

Assigning an Object

There are three standard assignment operators:

  • <-
  • =
  • assign()

“Best practice” is to use the <- assignment operator.

x1 <- 3 
x2 = 3
assign("x3",3)
c(x1, x2, x3)
[1] 3 3 3

Assigning an Object

Note that lexical scoping is flexible. Objects can be written and re-written when necessary.

object <- 5
object
[1] 5
object <- "A Very Vibrant Shade of Purple"
object
[1] "A Very Vibrant Shade of Purple"

Down the road it will help to give objects meaningful names!

Objects

One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the environment tab)…

Objects

One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the environment tab)… or by typing ls() in the console.

ls()
[1] "object" "x"      "x1"     "x2"     "x3"    

Object Classes

Once assigned, an object has a class. A class describes the properties of the data type or data structure assigned to an object.

We can use the function class() to find out what kind of data type or structure our object is.

class(x)
[1] "numeric"

The object x is of class numeric, i.e. a number.

Object Classes

There are many classes that an object can take.

obj1 <- "This is a sentence"
obj2 <- TRUE
obj3 <- factor("This is a sentence")
c(class(obj1),class(obj2),class(obj3))
[1] "character" "logical"   "factor"   

Understanding what class of object one is dealing with is important — as it will determine what kind of manipulations one can do or what functions an object will work with.

Object Classes

As noted, there are many different data types in R. We will primarily run into the following types:

Type Example
Integer 7
Numeric 4.56
Character “Hello!”
Logical TRUE
Factor "cat" (1)

Object Coercion

When need be, an object can be coerced to be a different class.

x
[1] 3
as.character(x)
[1] "3"

Here we transformed x – which was an object containing the value 3 – into a character. x is now a string with the text “3”.

Removing objects from the Environment

We often want to get rid of objects after creating them. To delete (or drop) an object from the working directory, use the function rm() – which stands for “remove”.

ls()
[1] "obj1"   "obj2"   "obj3"   "object" "x"      "x1"     "x2"     "x3"    
rm(x,x1,x2,x3,X)
ls()
[1] "obj1"   "obj2"   "obj3"   "object"

Clearing the Environment

We can also remove all objects from the environment at once by typing the following command.

rm(list=ls(all=T))

Or we can do so from R Studio by clicking on the broom icon.

Objects: So what's the point?

Objects offer a way to reference different data. This means that we can play around with a lot of different data type simultaneously.

This makes it easier to:

  • manage and use multiple datasets at the same time
  • extract and manipulate single variables
  • work with little bits of data at a time to make sure your calculations work.

Note that the only way to hold onto information is to assign it as an object! Else the information is printed but instantly forgotten by R

Functions

What are functions?

A function is a type of object in R that can perform a specific task. Unlike objects that hold data, functions take arguments and return the output of some manipulation.

A function is specified first with the object name and then parentheses. For example, the function log() calculates the natural log of any number placed inside the parentheses.

log(4)
[1] 1.386294

Where are functions exactly?

Functions operate in the background.

There are a number of functions in R, known as base functions, that are always running when you turn R on.

When we need to do things that are not a part of the base functionality, we can import new functions by installing packages. But more on this later.

Some common functions

We've already come a across a few functions, and we'll learn a lot more moving forward. Just keep in mind that whenever something is wrapped in parentheses (), it's a function.

Here are examples of a few common base functions that we'll see.

Function Description
c() links entries together as a vector
as.character() coerces the input to be a character class
length() reports how “long” a vector or data frame is
dim() reports the dimensions of a data frame
class() reports the class of an object

Figuring out what a function does...

All functions in R contain rich documentation regarding how a function works, the inputs it requires, and example code. We can access this documentation by using ? in front of the function.

?c()

Data Structures

Data Structures

There are also many ways data can be organized in R.

The same object can be organized in different ways depending on the needs to the user. Some commonly used data structures include:

  • vector
  • matrix
  • data.frame
  • list
  • array

Data Structures: Vector

X <- c(1, 2, 4, 5, 44, 6, 10)
X
[1]  1  2  4  5 44  6 10
class(X)
[1] "numeric"
length(X)
[1] 7

Data Structures: Data Frame

data.frame(X)
   X
1  1
2  2
3  4
4  5
5 44
6  6
7 10

Data Structures: Matrix

matrix(X)
     [,1]
[1,]    1
[2,]    2
[3,]    4
[4,]    5
[5,]   44
[6,]    6
[7,]   10

Data Structures: List

list(X)
[[1]]
[1]  1  2  4  5 44  6 10

Data Structures: Array

array(X,dim = c(2,2,2))
, , 1

     [,1] [,2]
[1,]    1    4
[2,]    2    5

, , 2

     [,1] [,2]
[1,]   44   10
[2,]    6    1

The point...

We need to understand the structure of a data object to understand how to access the information inside.

There are many ways to organize the same piece of information in R, and different data structures afford us different advantages and bring with them different limitations.

Throughout this short course, data frames will be the dominate data structure that we use; however, as you become more acquainted with R, you'll see and use other types of data structures more often.

Accessing Data

Accessing a Data Object

One must understand the structure of an object in order to systematically access the material contained within it.

Let's use a dataset inherent to R called cars. There are a number of datasets that are built into R. These are for demonstration purposes.

Note that these data will not appear in the environment until we assign them to an object.

data <- cars
class(data)
[1] "data.frame"

Accessing Data structure

An easy way to see what's inside a data object is to just print() it. R prints objects automatically in the console.

data
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17
11    11   28
12    12   14
13    12   20
14    12   24
15    12   28
16    13   26
17    13   34
18    13   34
19    13   46
20    14   26
21    14   36
22    14   60
23    14   80
24    15   20
25    15   26
26    15   54
27    16   32
28    16   40
29    17   32
30    17   40
31    17   50
32    18   42
33    18   56
34    18   76
35    18   84
36    19   36
37    19   46
38    19   68
39    20   32
40    20   48
41    20   52
42    20   56
43    20   64
44    22   66
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85

Accessing a Data structure

We can look at the structure of a data object by using the str() function.

str(data)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Or grad that variable names using the colnames() function.

colnames(data)
[1] "speed" "dist" 

Accessing a Data structure

We can leverage what we know about the dimensionality of the data to extract parts of it.

We do this by using brackets [] alongside the data object. We then can access the dimensions in the data by specifying the row and column

data[row,column]

Accessing a Data structure

The function dim() can tell use about the dimensions of a data object.

dim(data)
[1] 50  2

We now know that the object data has 50 rows and 2 columns.

Accessing a Data structure

data[,2] # Access the entire 2nd column
 [1]   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34
[18]  34  46  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76
[35]  84  36  46  68  32  48  52  56  64  66  54  70  92  93 120  85
data[49,] # Access just the 49th row
   speed dist
49    24  120

Accessing a Data structure

data[1,2] # Access just a cell
[1] 2

The key is to keep in mind the dimensions. We can't access data that isn't there.

data[51,]
   speed dist
NA    NA   NA

Accessing a Data structure

Most data objects can be accessed using $ call sign.

$ acts as a key by which we can extract a specific variable or data feature.

If we hit Tab after specifying the $ after our data object, R Studio will offer a list of all available variables.

Accessing a Data structure

Here we call the speed variable from our dataset.

data$speed
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25

There are many functions designed to help us understand the dimensions of a data structure.

dim(data) # Dimensions 
[1] 50  2
nrow(data) # Number of Rows
[1] 50
ncol(data) # Number of Columns
[1] 2

There are also some useful functions built into R to view portions of a data structure.

head(data,3) # Reports the 3 first entries 
  speed dist
1     4    2
2     4   10
3     7    4
tail(data,3) # Reports the 3 last entries
   speed dist
48    24   93
49    24  120
50    25   85

summary() allows for one to quickly summarize the distributions across a set of variables

summary(data)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

Packages

R Packages

There are a number of packages that are supplied with the R distribution. These are known as “base packages” and they are in the background the second one starts a session in R.

A package is a set of functions and programs that perform specific tasks. By installing packages, we introduce new forms of functionality to the R environment.

R Packages

To use the content in a package, one first needs to install it. One can do this by utilizing the following function: install.packages(). By inserting the name of a specific package, we can connect to an R “mirror” and download the binary of the package.

install.packages("ggplot2")

The version of that package is then saved on your computer and can be called at any time (on or offline).

R Packages

Once installed, it's on the system for good. You can then reference or load the package any time you wish to use a function from it.

There are two functions we can use to load a package: library() and require().

library(ggplot2)

# or 

require(ggplot2)

You must load the package before you can use any function in it.

R Studio also offers us a way to install packages through the interface.

If we click on the Packages tab and then click Install, we can download a package by typing its name.

We then can load the package from R Studio by clicking the check box beside the packages name.

Sometimes one has a lot of packages running simultaneously.

No problem: we can see what packages are up and running by typign sessionInfo() into the console.

This will tell us everything about the version of R and the packages we are using to run our analysis.

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.3

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.15

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.3.2   stringi_1.1.2 stringr_1.2.0 evaluate_0.10

Remember to Load Your Package!

If you ever try to run a function and you get the following prompt…

Error: could not find function "qplot"

It's likely you forgot to load the package .

require(ggplot2) # First Load the package
qplot() # Then run the function
# Wah-la!

Importing/Exporting Data

R allows you to import a large variety of datasets into the environment. However, R's base packages only support a few data types.

No Fear: there is usually always an external package that can do the job!

We are going to focus on three packages to import different data types:

  • readr — an expansive array of functions to read different data types
  • readxl — for excel spreadsheets
  • haven — for SPSS, SAS, and .dta

First, we need to install these packages onto our computer.

install.packages("readr") 
install.packages("readxl") 
install.packages("haven") 

And then load them into our current R Session.

require(readr)
require(readxl)
require(haven)

But where is my data exactly?

R doesn't intuitively know where your data is. If the data is in a special folder entitled “my_data”, we have to tell R how to get there.

We can do this three ways:

  1. Set the working directory to that folder
  2. Set the directory via a point-and-click option in R Studio
  3. Establish the path to that directly to the folder

Setting the Working Directory

Every time R boots up, it does so in the same place, unless we tell it to go somewhere else.

We can find out which directory we are in by using the getwd() function.

getwd() # Get the current working directory
/Users/edunford/

Setting the Working Directory

Every time R boots up, it does so in the same place, unless we tell it to go somewhere else.

We can then set a new working director by establishing the path to the folder we want to work in as a string in the function setwd()

setwd("/Users/edunford/Desktop/my_data")
getwd()
 /Users/edunford/Desktop/my_data/

Setting the WD via R Studio

R Studio also makes setting the working directory really easy.

Click: SessionSet Working DirectoryChoose Directory...

This will allow you to set the working directly quickly. The downside is that you have to do it manually every time you return to this project. By writing a script for everything you do, it is easier to replicate (and for others to replicate) your work.

Establishing the Path

Finally, we can also just point directly to the data by outlining the specific path.

Here we are assigning a sting containing our path to the object path.

path <- "~/Desktop/my_data/data.csv"

We then load the data using that path.

read.csv(path)

Importing data

Here we will review how to import five separate data types:

  • .dta — STATA file format
  • .csv — comma seperated file format
  • .sav — SPSS file format
  • .xlsx — standard Excel file format
  • .Rdata — R's file format

.dta

For all versions of STATA

require(haven)
data <- read_dta(file = "data.dta")


Other packages:

  • readstata13
  • foreign

.csv

read.csv() and read.table() are both base functions in R.

data <- read.csv(file = "data.csv",
                stringsAsFactors = F)
# Or
data <- read.table(file = "data.csv",
                  header = T, 
                  sep=",",
                  stringsAsFactors = F)

These functions have specific arguments that we are referencing:

  • stringsAsFactors means that we don't want all character vectors in the data.frame to be converted to factors.
  • header means the first row of the data are column names.
  • sep means that entries are separated by commas.

.csv

The readr package provides a much simpler approach.

require(readr)
data <- read_csv("data.csv")
  • characters aren't converted to factors.
  • More efficient as \( N \) increase

.sav

For SPSS and SAS file formats, the haven packages offers a simple way of reading in data.

require(haven)
data <- read_sav(file = "data.sav") # SPSS

.xlsx

require(readxl)
data <- read_excel("data.xlsx")

Even select from specific sheets.

excel_sheets("data.xlsx") # list avail. sheets
[1] Sheet1, Sheet2
data <- read_excel("data.xlsx",
                   sheet = 'Sheet1')

.Rdata

.Rdata is the data source inherent to R. It saves and loads objects.

load(file='data.Rdata')

Importing Data Using R Studio

There is also a point-and-click option for importing and exporting data in R.

If we go into the Environments tab and then click Import Dataset

Exporting data

Exporting data is the same process in reverse. Instead of reading the data, we want to write a new version of it.

There are a series of functions (each provided by their respective packages) that allow us to do just that.

Each require that you input the data that you're looking to export and specify the file name and paths to tell the computer where the file is going.

Exporting data

write_dta(data,path ="data.dta") 

write_csv(data,path ="data.csv") 

write_sav(data,path ="data.sav")

write_sas(data,path ="data.sas")

write_tsv(data,path ="data.tab")

# etc.

.Rdata

.Rdata offers two options to save data. We can either save a single data object, or save the entire workspace

# Save just an object
save(data, file="data.Rdata") 


# Save the entire workspace
save.image(file="workspace.Rdata") 

Applied Example 1

You'll find an annotated script walking through object creation and data importation in R.

  • objects_and_importing_data.R

Operators

Mathematical Operators

Broadly speaking, R functions as general calculator that can process a variety of data types.

As we can see, most operators in R are the usual suspects, but some forms are particular to R.

Operation             Calc           Out

Addition              3 + 4           7
Subtraction           3 - 4          -1
Multiplication        3 * 4           12
Division              3 / 4          .75
Exponentiation        3 ^ 4           81

In the example, we'll walk through a few more operators.

Mathematical Operators: Functions

There are a range of functions designed to ease mathematical calculations. Some of these functions are to calculate specific values, such as the natural log or Euler's number (\( e^a \)).

log(4)
[1] 1.386294
exp(5)
[1] 148.4132

There are a range of functions designed to ease mathematical calculations. Others can be used to find the sum for a numerical vector, the mean, or the median

x <-  c(1,3,7,100)
sum(x)
[1] 111
mean(x)
[1] 27.75
median(x)
[1] 5

Logical Operators

Boolean statement (i.e. true/false statements) are central to any computer programming environment. Boolean statements allow us to make quick conditional evaluations, which are key to subsetting data.

The following outlines the various types of boolean statements available.

x == y      # equals to
x != y      # does not equal
x >= y      # greater than or equal to
x <= y      # less than or equal to
x > y       # greater than
x < y       # less than

Statements can be combined using and (&) or (|) statements to make more specific queries.

x==1 & y==5 # "and" conditional statements
x==1 | y==5 # "or" conditional statements

Boolean statements can be fed directly into data objects via the brackets method []. This offers a powerful and simple way to subset data.

x <-  c(1,33,100,.6,5,77)
x
[1]   1.0  33.0 100.0   0.6   5.0  77.0
x[x > 30]
[1]  33 100  77

There are also a number of base functions that provide useful boolean evaluations. Here are just a few examples…

is.character("hello") # for class
[1] TRUE
all(c(T,F,F)) # are all entries True?
[1] FALSE
identical(1+1,2) # are these entries the same?
[1] TRUE

Finally, boolean statements have a nice property in R. If we convert a boolean statement to a numeric class, TRUE values convert to 1 and FALSE values convert to 0.

This offers us a quick way of generating dichotomous values.

x <- 1:10
x >= 5
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
as.numeric(x >= 5)
 [1] 0 0 0 0 1 1 1 1 1 1

Cleaning Text

Dealing with text

We often must deal with problematic text data. Sometimes we need to format responses from a survey so that we can use them in some analysis; other times we are just trying to calculate the date.

Most data is often riddled with errors and issues that are costly to resolve. In a sense, this data is dirty. We can't run analysis on dirty data.

Regular expressions are a special text string for describing a search pattern. We can extract, clean, and manipulate text using these expressions — which can save one hours from needing to manually clean data.

Regular Expressions

Consider the following string vector…

countries <- c("Canada","Russia","New Zealand","New Guinea")

Say, from this vector, we want to identify which entry contains the word “new”. The grep() function can help us identify a specific pattern, which it will then return the position of the string.

grep(pattern = "New", countries)
[1] 3 4

Here it return position 3 and position 4, which correspond with the position in the vector.

grep()

We can use that position to draw out specific content.

position <- grep(pattern = "New", countries)
countries[position]
[1] "New Zealand" "New Guinea" 

This feature can be useful to identify relevant content in variable or body of text.

gsub()

gsub() can help us actually manipulate the content in a string by identifying a pattern and then replacing it with something new.

countries
[1] "Canada"      "Russia"      "New Zealand" "New Guinea" 
gsub(pattern = "New",replacement = "Old",countries)
[1] "Canada"      "Russia"      "Old Zealand" "Old Guinea" 

Cases

We can also manipulate cases with the tolower() and toupper() functions.

string <- "This Is ReAlLY imPORtant."
tolower(string)
[1] "this is really important."
toupper(string)
[1] "THIS IS REALLY IMPORTANT."

Trimming White Space

We can also get rid of excessive spaces using the trimws() function.

sent <- "        This sentence has a ton of white space             "
sent
[1] "        This sentence has a ton of white space             "
trimws(sent)
[1] "This sentence has a ton of white space"

Generic Patterns

There are generic ways to draw out specific kinds of content from a string: such as digits or punctuation. There are many different types of regular expressions, and we don't have time to review all of them here, but here are a few useful ones.

  • "\\w" → words
  • "\\d" → digits
  • "\\s" → space character
  • "*" → fuzzy
  • "+" → More than one
  • "[]" → Match anything inside the brackets

Here let's remove problems from the following string using gsub().

trouble <- "This ::String is a 2Problem; 56"
trouble <- gsub("[::]","",trouble)
trouble
[1] "This String is a 2Problem; 56"
trouble <- gsub("\\d*","",trouble)
trouble
[1] "This String is a Problem; "
trouble <- gsub("[;]",".",trouble)
trimws(trouble)
[1] "This String is a Problem."

We can also target all punctuation with the "[[:punct:]]" regular expression.

dirt <- "C^lean%% this $%&*_@string((!"
dirt
[1] "C^lean%% this $%&*_@string((!"
gsub("[[:punct:]]","",dirt)
[1] "Clean this string"

Joining Text

We can also join or paste text using R. To do so, we'll use the paste() function, which takes two arguments: the strings and a specified separator.

sent1 <- "It is nice outside."
sent2 <- "I'll go for a walk."
paste(sent1,sent2,sep = " ")
[1] "It is nice outside. I'll go for a walk."
paste(sent1,sent2,sep = "::::")
[1] "It is nice outside.::::I'll go for a walk."

Collapsing

We can also use paste() to collapse a string vectors down into a single line. We do this by specifying the collapse= argument, which is like separate in that it wants to know how the vector should be collapsed.

countries
[1] "Canada"      "Russia"      "New Zealand" "New Guinea" 
paste(countries,collapse=", ")
[1] "Canada, Russia, New Zealand, New Guinea"

collapse= can be used with paste() in useful ways.

sent1 <- "These are the countries that matter:"

countries_sent <- paste(countries,
                        collapse=", ")

paste(sent1, countries_sent,sep=" ")
[1] "These are the countries that matter: Canada, Russia, New Zealand, New Guinea"

Dates and Time

R has a specific Date class. We will use the function as.Date() to coerce a relevant string into a date class.

str <- "2006-04-30"
class(str)
[1] "character"
date_str <- as.Date(str)
class(date_str)
[1] "Date"

Objects of class date have some nice properties, that makes analysis and manipulation easy.

date_str
[1] "2006-04-30"
date_str + 30 # date in 30 days
[1] "2006-05-30"
date_str - 3000 # date 300 days ago. 
[1] "1998-02-11"

This also allows us to look at the distance between two dates.

date1
[1] "2015-06-07"
date2
[1] "2013-02-14"
date1-date2
Time difference of 843 days

Formatting Dates

That said, dates come in many different formats. To let R know that a specific string is a date, we have to tell it the date format.

example <- "February 3, 1987"
as.Date(example)
  Error in charToDate(x) : 
  character string is not in a standard unambiguous format

Formatting Dates

That said, dates come in many different formats. To let R know that a specific string is a date, we have to tell it the date format.

example <- "February 3, 1987"
as.Date(example, format = "%B %d, %Y")
[1] "1987-02-03"

Formatting dates is similar to regular expressions in that it has a special syntax. In a string (i.e. using “ ”), we specify the exact pattern of the date with all appropriate punctuation and spacing.

The following are the main expressions used in formatting.

  • %d → day as a number
  • %a → abbreviated weekday
  • %A → unabbreviated weekday
  • %m → month as number
  • %b → abbreviated month
  • %B → unabbreviated month
  • %y → 2 digit year
  • %Y → 4 digit year
as.Date("Friday March 13, 2009","%A %B %d, %Y")
[1] "2009-03-13"
as.Date("11/13/14","%m/%d/%y")
[1] "2014-11-13"
as.Date("7th of May 2000","%dth of %B %Y")
[1] "2000-05-07"

Applied Example 2

Open up an R Studio session and open operators_and_cleaning_text.R.

Here we'll review some of the textual manipulations we learned and we'll explore the powerful stringr package for text manipulation.

Further References

There are some great resources out there to help you climb the R learning curve.

Next Time

With our general understanding of R, we'll cover a comprehensive logic for data manipulation and graphics.

The goal is to leave with a thorough tool kit for data analytics in R.

Finally, we'll cover basic statistical models in R.

See you all for Session 2!


October 12 at 1pm


Please contact me if you have any questions in the meantime. Thanks!

Eric Dunford | Ph.D. Candidate
Department of Government and Politics
University of Maryland, College Park
edunford@umd.edu
www.ericdunford.com